Jupyter for Data Science: Exploratory analysis, statistical modeling, machine learning, and data visualization with Jupyter by Dan Toomey

Jupyter for Data Science: Exploratory analysis, statistical modeling, machine learning, and data visualization with Jupyter by Dan Toomey

Author:Dan Toomey [Toomey, Dan]
Language: eng
Format: epub
Tags: COM018000 - COMPUTERS / Data Processing, COM062000 - COMPUTERS / Data Modeling and Design, COM089000 - COMPUTERS / Data Visualization
Publisher: Packt Publishing
Published: 2017-10-19T23:00:00+00:00


Product

Product ID,

Description,

Price

Order

Order ID,

Order Date

ProductOrder

Order ID,

Product ID,

Quantity

So, an Order has a list of Product/Quantity values associated.

We can populate the data frames and move them into Spark:

from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) # load product set productDF = spark.read.format("csv") \ .option("header", "true") \ .load("product.csv"); productDF.show() productDF.createOrReplaceTempView("product") # load order set orderDF = spark.read.format("csv") \ .option("header", "true") \ .load("order.csv"); orderDF.show() orderDF.createOrReplaceTempView("order") # load order/product set orderproductDF = spark.read.format("csv") \ .option("header", "true") \ .load("orderproduct.csv"); orderproductDF.show() orderproductDF.createOrReplaceTempView("orderproduct")

Now, we can attempt to perform an SQL-like JOIN operation among them:

# join the tables joinedDF = spark.sql("SELECT * " \ "FROM orderproduct " \ "JOIN order ON order.orderid = orderproduct.orderid " \ "ORDER BY order.orderid") joinedDF.show()

Doing all of this in Jupyter results in the display as follows:



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.